Method and apparatus for an adaptive codebook search

ABSTRACT

An adaptive codebook search (ACS) algorithm is based on a set of matrix operations suitable for data processing engines supporting a single instruction multiple data (SIMD) architecture. The result is a reduction in memory access and increased parallelism to produce an overall improvement in the computational efficiency of ACS processing.

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] NOT APPLICABLE

STATEMENT AS TO RIGHTS TO INVENTIONS MADE UNDER FEDERALLY SPONSOREDRESEARCH OR DEVELOPMENT

[0002] NOT APPLICABLE

REFERENCE TO A “SEQUENCE LISTING,” A TABLE, OR A COMPUTER PROGRAMLISTING APPENDIX SUBMITTED ON A COMPACT DISK.

[0003] NOT APPLICABLE

BACKGROUND OF THE INVENTION

[0004] The present invention relates to speech processing in general,and more particularly to a speech encoding method and system based oncode excited linear prediction (CELP).

[0005]FIG. 6 shows the conventional model for human speech production.The vocal cords are modeled by an impulse generator that produces animpulse train 602. A noise generator produces white noise 604 whichmodels the unvoiced excitation component of speech. In practice, allsounds have a mixed excitation, which means that the excitation consistsof voiced and unvoiced portions. This mixing is represented by a switch608 for selecting between voiced and unvoiced excitation. An LPC filter610 models the vocal tract through which the speech is formed as the airis forced through it by the vocal chords. The LPC filter is a recursivedigital filter; its resonance behavior (frequency response) beingdefined by a set of filter coefficients. The computation of thecoefficients is based on a mathematical optimization procedure referredto as linear prediction coding, hence “LPC filter.”

[0006] Code-excited linear prediction (CELP) is a speech codingtechnique commonly used for producing high quality synthesized speech atlow bit rates, i.e., 4.8 to 9.6 kilobits-per-second (kbps). This classof speech coding, also known as vector-excited linear prediction,utilizes a codebook of excitation vectors to excite the LPC filter 610in a feedback loop to determine the best coefficients for modeling asample of speech. A difficulty of the CELP speech coding technique liesin the extremely high computationally intense activity of performing anexhaustive search of all the excitation code vectors in the codebook.The codebook search consumes roughly 60% of the total processing time ofa speech codec (compression encoder-decoder).

[0007] The ability to reduce the computation complexity withoutsacrificing voice quality is important in the digital communicationsenvironment. Thus, a need exists for improved CELP processing.

SUMMARY OF THE INVENTION

[0008] A method and system for speech synthesis includes an adaptivecodebook search (ACS) process based on a set of matrix operations suitedfor data processing engines which support one or more SIMD (singleinstruction multiple data) instructions. A set of matrix operations weredetermined which recast the conventional standard algorithm for ACSprocessing so that a SIMD implementation achieves not only improvedcomputational efficiency, but also reduces the number of memory accessesto realize improvements in CPU (central processing unit) performance.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009]FIG. 1 shows a high level system block diagram of a speechsynthesis system in accordance with an embodiment of the invention;

[0010]FIG. 1A shows a generalized block diagram of a typical hardwareconfiguration of a speech synthesizer, incorporating aspects of theinvention;

[0011] FIGS. 2A-2D illustrate the matrix operations in accordance withthe invention;

[0012] FIGS. 3A-3C illustrate generalized matrix operations according tothe teachings of the invention;

[0013]FIGS. 4A and 4B illustrate a high level discussion of a flow chartfor performing the matrix operations shown in FIG. 3C;

[0014]FIGS. 5A and 5B illustrate a generalization of the matrixoperations to include SIMD processing engines having n-way parallelism;and

[0015]FIG. 6 illustrates a conventional model of the human vocal tract.

DESCRIPTION OF THE SPECIFIC EMBODIMENTS

[0016]FIG. 1 shows a high level block diagram of a speech coder 100,embodying aspects of the present invention. The block diagram representsthe functional aspects of a speech coder in accordance with a particularimplementation standard, namely, G.723. It can be appreciated that otherstandards, such as G.728, G.729, implement the same function, and evenspecial purpose non-standard codecs can be built to implement similarfunctionality. An excitation signal 126 is fed as input to a synthesisfilter 112. The excitation signal is chosen from a codebook ofexcitation sequences 118 commonly referred to as excitation codevectors. For each frame of speech, a codebook search process 102 selectsan excitation signal and applies it the synthesis filter 112 to generatea synthesized speech signal 106. The synthesized speech is compared 122to the original input speech signal 104 to produce an error signal. Theerror signal is then weighted by passing it through a weighting filter114 having a response based on human auditory perception. The weightederror signal is then processed by the error calculation block 116 (e.g.,per G.723) to produce a residual excitation signal 108 (also referred toas a target vector signal).

[0017] The optimum excitation signal is determined in the codebooksearch process 102 by selecting the code vector which produces theweighted error signal representing the minimum energy for the currentframe; i.e., the search through a codebook of candidate excitationvectors is performed on a frame-by-frame basis. Typically, the selectioncriterion is the sum of the squared differences between the original andthe synthesized speech samples resulting from the excitation informationfor each speech frame, called the mean squared error (MSE).

[0018] Referring to the general architectural diagram of a speechsynthesis system 140 of FIG. 1A, it can be appreciated that numerousspecific implementations of the components shown in FIG. 1 are possible.A common implementation of the processing components (e.g., filter 112,search process 102, and so on) is on a digital signal processor (DSP),executing appropriately written code for the DSP. The processingcomponents can be implemented on a PC (personal computer) platformexecuting one or more software components. Depending on performancerequirements, the components might be implemented using multiplehardware processing units.

[0019] As shown in FIG. 1A, the processing component 152 includes asingle instruction multiple data (SIMD) architecture which implements aSIMD instruction set. Generally, any SIMD engine can be used as theprocessing component and is not limited to conventional processors.Thus, for example, a custom ASIC that supports at least a SIMD multiplyand accumulate instruction can be used.

[0020] The speech coder can utilize various storage technologies. Atypical storage (memory) component 154 of the system can includeconventional RAM (random access memory) and hard disk storage. Theprogram code that is executed can reside wholly in a RAM component, orportions may be stored in RAM and/or a cache memory and other portionson a hard drive as is commonly done in modem operating system (OS)environments. The program code can be stored in firmware. The codebookmight be stored in some form of non-volatile memory. Otherimplementations can include ASIC-microcontroller combinations, and soon.

[0021] A signal converter 156 is typically included to convert theanalog speech-in signal to a suitable digital format, and conversely ananalog speech-out signal can be produced by converting the digital data.The SIMD-based processor 152 can include one or more control signals 166which are communicated to operate the signal converter. Data channel 162and 164 can be provided to provide data paths among the variouscomponents.

[0022] The speech synthesis system 140 can be any system that utilizesspeech synthesis or otherwise benefits from speech synthesis. Examplesinclude mobile devices supporting voice communication such as videoconference systems, audio recorders, dictaphones, voice mail boxes,order processing systems, security, and intercom systems. These devicestypically require real time processing capability, have limits on powerconsumption, and have limited processing resources. Further, mostcurrent day fixed point application processors have SIMD extensions. Thepresent invention uses the SIMD architecture to reduce the computationalload on the data processing component 152. Hence devices can operate ina lower power mode. Mail boxes and dictaphones having limited processingresources use uncompressed voice transactions. These devices can bereplaced by the voice codecs using compression technology, therebyincreasing the efficiency of storage. Existing mobile phones andconference systems make use of CELP based voice codecs. The presentinvention frees up the processor to perform additional functions, orsimply to save power. Most existing analog voice applications such asintercom/security systems will be eventually replaced by digital systemswith content compression for better resource usage, and thus would bewell suited for use with the present invention.

[0023] The calculation which takes place in the codebook search process102 involves computing the convolution of each excitation frame storedin the codebook with the perceptual weighted impulse response.Calculations are performed by using vector and matrix operations of theexcitation frame and the perceptual weighting impulse response. Thecalculation includes performing a particular set of matrix computationsin accordance with the invention to compute a correlation vectorrepresenting the correlation between the target vector signal 108 and animpulse response.

[0024] As mentioned above, adaptive codebook search involves searchingfor a codebook entry that minimizes the mean square error between theinput speech signal and the synthesized speech. It can be shown (per theG.723.1 ITU specification) that the computation of MSE can be reduced toan equation whose “maximum” represents the best codebook entry to beselected:${{{Max}\quad {Val}} = \left( \frac{\left( {d^{T}v_{i}} \right)^{2}}{v_{i}^{T}\varphi \quad v_{i}} \right)},$

[0025] where

[0026] i is an index into codebook,

[0027] v_(i) is the excitation vector at index i,

[0028] φ=H^(T)H,

[0029] d=H^(T)R,

[0030] R is the target vector signal, and

[0031] H is the impulse response of the synthesis filter 112 (FIG. 1).

[0032] The quantity d represents the correlation between the targetvector signal r and the impulse response H. The quantity d is definedby:${d = {\sum\limits_{n = j}^{FrmSz}{{R\lbrack n\rbrack} \cdot {H\left\lbrack {n - j} \right\rbrack}}}},$

[0033] where FrmSz is the frame size, e.g., 59 frames, and 0≦j≦FrmSz.

[0034] The quantity φ represents the covariance matrix of the impulseresponse:$\varphi = {\sum\limits_{n = j}^{FrmSz}\quad {{H\left\lbrack {n - i} \right\rbrack} \cdot {{H\left\lbrack {n - j} \right\rbrack}.}}}$

[0035] For each excitation vector v_(i), a metric MaxVal_(i) iscomputed. Each excitation vector therefore has an associated MaxVal_(i).A minimum value of the metric is determined and the vector associatedwith that metric is deemed to be the entry that minimizes the meansquare error.

[0036] FIGS. 2A-2D illustrate a procedure for computing the correlationquantity d according to the teachings of the present invention. First, abrief discussion of a conventional implementation for computing thecorrelation quantity is presented.

[0037] The equation for d for a speech codec (coder/decoder) per the ITU(International Telecommunication Union) reference ‘C’ implementation isexpressed as:${\sum\limits_{i = 0}^{FrmSz}{\sum\limits_{j = 0}^{i}\quad \left( {{{RzBf}\left\lbrack {{pitch} - 1 + j} \right\rbrack} \times {{ImpRes}\left\lbrack {i - j} \right\rbrack}} \right)}},$

[0038] where

[0039] RzBf is the residual excitation buffer (i.e. the target vectorsignal),

[0040] ImpRes is the impulse response buffer, and

[0041] pitch is a constant.

[0042] A typical scalar implementation of this expression is shown bythe following C-language code fragment: for ( i = 0 ; i <SUB_FRAME_LENGTH ; i ++ ) { Acc0 = (Word32) 0 ; for (j = 0 ; j <= 1 ; j++ ) { Acc0 = saturate( Acc0 + RezBuf[CL_PITCH_ORD−1+j]* ImpResp[i−j] ); } FltBuf [CL_PITCH_ORD−1][i] = round( Acc0 ); }

[0043] The ‘saturate( )’ function or some equivalent is commonly used toprevent overflow.

[0044] A line-by-line statistical profiling of a conventional adaptivecodebook search algorithm indicates that the foregoing implementationfor computing the correlation quantity d consumes about one third of thetotal processing time in a speech codec.

[0045] It was discovered that a decomposition of the expression:${\sum\limits_{i = 0}^{FrmSz}{\sum\limits_{j = 0}^{i}\quad \left( {{{RzBf}\left\lbrack {{pitch} - 1 + j} \right\rbrack} \times {{ImpRes}\left\lbrack {i - j} \right\rbrack}} \right)}},$

[0046] can be produced that reduces the computational load for computingthe correlation quantity. More specifically, it was discovered that acertain combination of matrix operations can be obtained which isreadily implemented using a SIMD instruction set. Moreover, theinstructions can be coded in a way that reduces the number of accessesbetween main memory and internal registers in a processing unit.

[0047] Referring now to FIGS. 2A-2D, a set of matrix operations is shownfor an iteration of the above nested summation operation. Here, thefollowing notational conventions will be adopted:

[0048] I[ ] is the vector ImpRes[ ], where a vector element isreferenced as I_(i),

[0049] R[ ] is the vector RzBf[ ], where a vector element is referencedas R_(i), and

[0050] F[ ] is an output vector FltBuf[ ] to store the result of theoperation and thus is representative of the correlation quantity d,where a vector element is referenced as F_(i).

[0051] In accordance with the invention, the first four elements of F[ ](F₀-F₃) can be expressed by the matrix operation shown in FIG. 2A. Thenext four elements F[ ] (F₄-F₇) can be expressed by the matrixoperations shown in FIGS. 2B and 2C. A constituent component of elementsF₄-F₇ is intermediate vector F′[ ] which is determined by the operationshown in FIG. 2B. This matrix operation represents the computation whichoccurs at the end of the series RzBf[pitch−1+j]×ImpRes[i−j].

[0052] Another constituent component of elements F₄-F₇ is intermediatevector F″[ ] which is determined by the operation shown in FIG. 2C. Thismatrix operation represents the computations which occur in the middleof the series RzBf[pitch−1+j]×ImpRes[i−j].

[0053] As can be seen in FIG. 2D, the elements F₄-F₇ of F[ ] can bedetermined as the sum of F′[ ] and F″[ ].

[0054] The matrix operations shown in FIGS. 2B and 2C lead to ageneralized set of computational operations to perform the entirecomputation of the correlation quantity d. This can be seen withreference to the generalized matrix operations shown in FIGS. 3A-3C.

[0055] Every four elements in F[ ] (e.g., F₄-F₇, F₈-F₁₁, F₁₂-F₁₅, etc.)can be determined by computing every four elements of its constituentintermediate vectors, F′ and F″.

[0056]FIG. 3A represents the generalized form for the matrix operationshown in FIG. 2B for computing the intermediate vector F′ for the entirevector F[ ], four elements at a time. The generalized form includes anindex n, which is incremented by four for each set of four elements inthe intermediate vector F′.

[0057]FIG. 3B represents the generalized form for matrix operation shownin FIG. 2C for computing the intermediate vector F″ for the entirevector F[ ], four elements at a time. This operation involves asummation operation because it occurs in the middle of the seriesRzBf[pitch−1+j]×ImpRes[i−j]. The notation in the summation:$\sum\limits_{\underset{l = 0}{m = {n + 3}}}^{\overset{\overset{{({m - 6})} > 0}{m,{{step}\quad - 4}}}{l,{{step}\quad + 4}}}\quad$

[0058] indicates that the index l begins at zero and increments by four.The index m begins at (n+3) and decrements by four. The summation stopswhen (m−6)≦0.

[0059]FIG. 3C shows the generalized form for computing the entire vectorF[ ]. Expressed in pseudo code format, it can be seen that the operation302 computes the first four elements of F[ ]. The operation 304 computesthe remaining elements of F[ ], four elements at a time. The termSubFormSz refers to the number of samples in a subframe.

[0060] In accordance with various implementations of the embodiments ofthe present invention these operations are implemented in a computerprocessing architecture that supports a SIMD instruction set. A commonlyprovided instruction is the “multiply and accumulate” (MAC) instruction,which performs the operation of multiplying two operands and summing theproduct to a third operand. A generic MAC instruction might be:

MAC %1%2%3,%3←%3+(%1×%2)

[0061] where %1, %2, and %3 are the register operands.

[0062] In a SIMD architecture, the MAC instruction performs theoperation simultaneously on multiple sets of data. Typically, theregisters used by a SIMD machine can store multiple data. For example, a64-bit register (e.g., %1) can contain four 16-bit data (e.g., %1₀, %1₁,%1₂, and %1₃) to provide what will be referred to as “4-way parallel”SIMD architecture. Thus, execution of the foregoing MAC instructionwould perform the following operations in a 4-way SIMD machine:

%3₀←%3₀+(%1₀×%2₀)

%3₁←%3 ₁+(%1₁×%2₁)

%3₂←%3₂+(%1₂×%2₂)

%3₃←%3₃+(%1₃×%2₃)

[0063] Typically, a SIMD instruction set comprises a full complement ofinstructions for all math and logical operations, and for memory loadand store operations. Specific instruction formats will vary from onemanufacturer of processing unit to another. However, the same ideas ofparallel operations are common among them.

[0064]FIGS. 4A and 4B show the process flow for performing theoperations shown in FIG. 3C. The SH5 SIMD instruction is used merely toprovide a context for explaining the figures. The SH5 instruction setsupports 4-way parallel instructions. A copy of the programmer user'smanual describing the SH5 instruction set is contained on a compact discin a PDF-formatted file. In this particular implementation in accordancewith an embodiment of the invention, vector elements (R[ ], I[ ], and F[]) are word-sized 16-bit data. It can be appreciated of course thatother word sizes are possible. The registers are 64 bits wide. For thefollowing discussion of FIGS. 4A and 4B, the vector F[ ] is representedby output vector Ynxt[ ].

[0065] The processing in FIG. 4A includes a step 402 of loading a quadword from memory area 154 a in the memory component (FIG. 1A) from thevector R[ ] (pointed to by ptrRend, initially set to point to thebeginning of the vector R[ ]). Each quad word represents four elementsof a vector. Thus, four elements (quad-word) from the vector R[ ] areloaded into a (64-bit) register R_(end) 152 c, and are identifiedgenerically as (r0, r1, r2, r3) without reference to any specific fourelements.

[0066] In a step 404, the quad words contained in the register R_(end)are copied to an intermediate register 152 e to produce the followingintermediate quad words: (0, 0, 0, r0), (0, 0, r0, r1), (0, r0, r1, r2),and (r0, r1, r2, r3). Each intermediate quad word is combined in a MAC(multiply and accumulate) operation with another intermediate register152 f which contains the first four words (I1, I2, I3, I4) from theimpulse response vector I[ ]. Thus, in a MAC operation (step 406 a), theoutput for y0 is computed:

y0=0×I ₃ +0×I ₂3+0×I ₁ +r0×I ₀.

[0067] Similarly in subsequent MAC operations (steps 406 b-406 d), thefollowing are computed:

y1=0×I ₃+0×I ₂3+r0×I ₁ +r1×I ₀,

y2=0×I ₃ +r0×I ₂3+r1×I ₁ +r2×I ₀,

y3=r0×I ₃ +r1×I ₂3+r2×I ₁ +r3×I ₀.

[0068] The outputs of the MAC operations are stored in registers used bythe SIMD engine 152 (FIG. 1A).

[0069] In a step 408, the contents of the registers containing theoutputs y0-y3 are written to the output vector Ynxt[ ] in a memory area154 b in the memory component 154, pointed to by a pointer ptrYnxt whichinitially points to the beginning of the vector.

[0070] Next, various pointers are updated in a step 410 in preparationfor the subsequent operations. The pointer ptrRend is incremented byfour. A pointer ptrInxt is copied to ptrIcur. A pointer ptrRnxt is setto the beginning of R[ ]. The ptrYnxt is incremented by four.

[0071] Note that by setting the pointers ptrRend to the beginning of thevector R[ ] and ptrYnxt to the beginning of vector Ynxt[ ], the veryfirst iteration through the foregoing steps produces the boundarycondition computation shown in FIG. 3C as operation 302. After theupdate step 410, the pointers are properly adjusted for to perform theoperation 304, the processing of which is shown in FIG. 4B. As can beappreciated, subsequent iterations through the foregoing steps producethe boundary condition computation identified as 304 a in FIG. 3C.

[0072] The processing in FIG. 4B includes a step 412 of loading a quadword from areas 154 a in the memory component 154 (FIG. 1A) that storethe vectors R[ ] and I[ ]. Thus, four elements from the vector R[ ]beginning at a location pointed to by a pointer ptrRnxt are loaded intoa register R_(nxt) 152 a, and are identified generically as (r0, r1, r2,r3). Four elements from the impulse response vector I[ ] in memory area154 a, beginning at a location pointed to by a pointer ptrInxt, aresimilarly loaded into another register T_(nxt) 152 b. However, anoperation to reverse the order of the four elements from I[ ] is firstperformed in a step 412 a to store the data referred to generically as(n3, n2, n1, n0).

[0073] Next, in a step 414, the data (n3, n2, n1, n0) in the I_(nxt)register 152 b and the data (p3, p2, p1, p0) in another register I_(prv)152 c are manipulated to produce combinations of quad words stored in anintermediate register 152 d, in preparation for a set of MAC operations(step 416). Thus, in a step 416 a, a MAC operation between the R_(nxt)register 152 a and the intermediate register 152 d containing the packedquad-word (n0, p3, p2, p1) produces the output y0 defined as:

y0=r0×n0+r1×p3+r2×p2+r3×p3

[0074] Similar operations are performed in steps 416 b-416 d, to produceoutputs y1-y3 respectively. The outputs y0-y3 are also registers used bythe SIMD engine 152 (FIG. 1A). In a step 418, the outputs are written tothe vector Ynxt[ ].

[0075] Registers are updated in a step 420 in preparation to continuethe inner sum operation. Thus, the contents of the I_(nxt) register arecopied to the I_(prv) register because in the next iteration the currentcontents of I_(nxt) become the “previous” contents. Various pointers tothe vectors in the memory 154 are updated. A pointer ptrRnxt isincremented by 4, as is the pointer ptrYnxt. A pointer ptrInxt isdecremented by four.

[0076] A test is performed in a step 401 to determine if the lower limitof the impulse vector I[ ] is exceeded. Step 401 checks the pointerptrInxt is decremented beyond this lower limit. The lower limit isdefined in the generalized inner sum operation 304 b (FIG. 3C) for theindex m. If the lower limit is not exceeded, then the operation repeatswith step 412, as indicated by the connector A. If the lower limit isexceeded, then the inner sum operation is complete. A pointer ptrRend(see FIG. 4B) is checked to determine if the end of the vector R[ ] isreached. If not, then the operation repeats with step 402 on FIG. 4A, asindicated by the connector B.

[0077] Referring to FIGS. 3A & 3B and 4A & 4B, it can be appreciatedthat the matrix operations according to the invention allow for areduction of memory access requirements, thus saving on valuable CPUcycles. The operations provide for reuse of data already retrieved forother operations. The shaded areas 312 a-312 c shown in FIGS. 3A and 3B(see also 212 a-212 d in FIGS. 2A-2C) represent data previouslyretrieved from memory 154. Thus, the matrix operation shown in FIG. 3Ainvolves a memory fetch of the four words for R_(n)-R_(n+3), shown inthe unshaded area. The SIMD MAC operation can then be applied to performthe indicated matrix operation. Note from FIG. 4A that the first fourelements of the impulse vector I[ ] are always used, so they will havebeen pre-load into a register at the very beginning of the matrixoperations.

[0078] Similarly, the matrix operation shown in FIG. 3B lends itself toreusing pre-fetched data in a SIMD architecture. The vector I[ ]elements I_(m−6)-I_(m−3), are stored as previously fetched elements sothat the inner sum of products operation requires only one fetchoperation from memory 154 to retrieve the quad words constitutingelements I_(m−3)-I_(m).

[0079] The following assembly code fragment is provided merely toillustrate an example of an implementation of the processing shown inFIGS. 4A and 4B. The example code is based on the SH5 instruction set.Various portions of the code are shown in bold text, underlined text,and italicized text to highlight the various operations shown in FIGS.4A and 4B. The code highlighted in bold text, perform the steps 402 to410 corresponding to the matrix operation 302 in Fig. C. The codehighlighted by the underlined text perform the steps 402 to 410 andsteps 422 and 403 corresponding to outer loop operation 304 a of thematrix operation 304 (the outer loop). The code highlighted by theitalicized text perform the steps 412 to decision step 401 correspondingto the inner loop operation 304 b of the matrix operation 304.

Example of Assembly Code for the SH5 Architecture

[0080] _obj_copy(x): copy content sof x in to a register, do not modifyx _reg_int(): allocate a register _label(): define a label, used as ajump target _obj_memory(): indicate that memory has been modifed. _code(“LT_PT %16,TR6 ; Load Target branch Reg 6” “LT_PT %17,TR7 ; Load Targetbranch Reg 7” “MOVI #27,%4 ; create control constant 0x1b in R27” “; forbyte manipulation using permute instruction” “LD.Q %2,#0,%3 ; Load 4words of the impulse response ImpResp]0,1,2,3]” “MOVI #16384,%18 ;Constant 0x4000 - value for rounding” “LD.Q %0,#0,%1 ; Load the residualexcitation buffer RezBuf[0,1,2,3]” “MPERM.W %3,%4,%3 ; Reverse permuteI[3 2 1 0]” “ADD %18,R63,%6 ; Move 0x4000 into accumulator (Reg 6)”“MEXTR2 R63,%1,%5 ; Extract the first word [0 0 0 R0]” “MMULSUM.WQ%3,%5,%6 ; (MAC) y0(%6) += [0 0 0 R0]*I[3 2 1 0]” “ADD %18,R63,%10 ;Move 0x4000 into second accumulator (Reg 10)” “MEXTR4 R63,%1,%5 ;Extract 2 words [0 0 R0 R1]” “MMULSUM.WQ %3,%5,%10 ; (MAC) y1 += [0 0 R0R1]*I[3 2 1 0]” “ADD %18,R63,%11 ; Move 0x4000 into thrid accumulator”“MEXTR6 R63,%1,%5 ; Extract 3 words [0 R0 1 2]” “MMULSUM.WQ %3,%5,%11 ;(MAC) y2 += [0 R0 1 2]*[3 2 1 0]” “ADD %18,R63,%12 ; Move 0x4000 intothrid accumulator” “MMULSUM.WQ %3,%5,%10 “MMULSUM.WQ %3,%1,%12 ; (mAC)y3 += [R0 1 2 3]*[3 2 1 0]” “;Combine the results into 32 bit packedformat.” “MSHFLO.L %6,%10,%10 ; y[0,1]” “MOVI #15,%19 ; Right shiftvalue” “MSHARD.L %10,%19,%10 ; scale down by 16” “MSHFLO.L %11,%12,%12 ;y[2,3]” “MSHARD.L %12,%19,%12 ;” “MCNVS.LW %10,%12,%12 ; Combine theabove accumulators into y[0 1 2 3]” “ADD %3,R63,%9 ; copy [I3 2 1 0]“ADD %0,R63,%13 l copy of R start address” “ST.Q %7,#0,%12 ; Store y[](y7)” “ADDI %0,#112,%15 ; Get the address of R[56 57 56 55]” “ADDI%2,#8,%2 ; point to I[4 5 6 7]” “%16: ; loop point” “ADDI %0,#8,%0 ;point to next R (R[4 5 6 7])” “LD.Q %0,#0,%1 ; Load next quad (R∂4 5 67])” “ADDI %7,#8,%7 ; point to next y” “;Initialize accumulators” “ADD%18,R63,%6 ; Move 0x4000 into yx” “ADD %18,R63,%10 ;” “ADD %18,R63,%11l” “ADD %18,R63,%12 ;” “;Computation for the end of the series for 4output” “MEXTR2 R63,%1,%5 ; Extract End R ([0 0 0 R4])” “MMULSUM.WQ%9,%5,%6 ; y (y4) = End R ([0 0 0 R4]) * Start I ([3 2 1 0])” “MEXTR4R63,%1,%5 ; Extract End R [0 0 R4 5]” “MMULSUM.WQ %9,%5,%10 ; y+1 (y5) =End R ([0 0 R4 5])*Start I ([3 2 1 0])” “MEXTR6 R63,%1,%5 ; Extract EndR [0 R4 5 6]” “MMULSUM.WQ %9,%5,%11 ; y+2 (y6) = End R ([0 R4 56])*Start I ([3 2 1 0])” “MMULSUM.WQ %9,%1,%12 ; y+3 (y7) = End R ([R4 56 7])*Start I ([3 2 1 0])” “ADD %13,R63,%14 ; %14 current ‘R’ Address”“ADD %2,R63,%1 ; %1: Tmp end addr of I” “%17; ;” “Computation of Quadmul-sums for the 4 outputs” “LD.Q %2,#0,%3 ; Load new I (I[4 5 6 7])”“LD.Q %2,#-8,%9 ; Load new-1 I (I[4 5 6 7])” “LD.Q %14,#0,%8 ; Load nextR (R[0 1 2 3])” “MPERM.W %3,%4,%3 ; Reverse permute (I[7 6 5 4])”“MPERM.W %9,%4,%9 ; Reverse permute (I[7 6 5 4])” “; %9: Last I Q wordloaded ([3 2 1 0])” “; %8: Lasr R Q word loaded ([0 1 2 3])” “MEXTR6%3,%9,%5 ; Extract I LSH 1([4 3 2 1])” “MMULSUM.WQ %8,%5,%6 ; y (y4) +=[R0 1 2 3]*[4 3 2 1]” “MEXTR4 %3,%9,%5 ; Extract I LSH 2([5 4 3 2])”“MMULSUM.WQ %8,%5,%10 ; Y (Y5) += [R0 1 2 3]*[5 4 3 2]” “MEXTR2 %3,%9,%5; Extract I LSH 3([5 6 4 3])” “MMULSUM.WQ %8,%5,%11 ; Y (Y6) += [R0 1 23]*[6 5 4 3]” “MMULSUM.WQ %8,%3,%12 ; y (y7) += [R0 1 2 3]*[7 6 5 4]”“ADDI %14,#8,%14 ; incr R ptr” “ADDI %2,#-8,%2 ; Decr I ptr” “BNE%14,%0,TR7 ; Loop to compute all quad mults” “;Combine the results into32 bit packed format.” “MSHFLO.L %6,%10,%10 ; y[0,1]” “MSHFLO.L%11,%12,%12 ; y[2,3]” “;scale down by 16” “MSHARD.L %10,%19,%10 ;”“MSHARD.L %12,%19,%12 ;” “MCNVS.LW %10,%12,%12 ; y[0 1 2 3]” “ADDI%1,#8,%2 ; Restore I ptr to next higher quad entry” “STq %7,#0,%12 ;Store y (y7)“ “BNE %14,%15,TR6 ; Loop for all set of 4 outputs”._obj_copy(RezBuf+4),_reg_int(),_obj_copy(ImpResp),_reg_int(),_reg_int(),_reg_int(),_reg_int(),_obj_copy(FltBuf[4]),_reg_int(),_reg_int(),_reg_int(),_reg_int(),_reg_int(),_reg_int(),_reg_(1')int(),_reg_int(),_label(),_label(),_reg_int(),_reg_(1')int(),_obj_memory());

[0081]FIG. 5A shows a generalized form of the matrix operations shown inFIGS. 2A-2C. Though the matrix operations in FIGS. 2A-2C are for a 4×4matrix configuration, it can be appreciated that these operations canscale to larger matrix configurations; for example, a set of 8×8 matrixoperations can be formulated. The subscripts used in the matrixoperations shown in FIG. 5A are based on 2^(s), where s is a positiveinteger greater than one. It can be seen that the operations in FIGS.2A-2C are defined by the operations shown in FIG. 5A for s=2.

[0082]FIG. 5B shows a further generalization of operations 504 and 506shown in FIG. 2A to produce a generalized form of the operation 304shown in FIG. 3C for computing the inner sum of products term. Here, theindex n is incremented by 2^(s), and the index m is a decremented by2^(s).

[0083] It can be seen that the generalized form shown in FIG. 5B issuitable for 2^(s)-way parallel SIMD architectures. For example, wheres=3, an 8-way SIMD machine can be used to implement the matrixoperations. It is noted however, that an 8-way SIMD instruction set canbe used to implement the 4×4 matrix operations shown in FIG. 3C. In suchan implementation, each MAC operation can be performed on two sets ofquad words.

[0084] Conversely, if a SIMD architecture provides for 2-wayparallelism, it can be appreciated that the matrix operations arenonetheless suited for 2-way parallel operations, albeit requiring twooperations to perform. For example, operations using a 4×4 matrix (i.e.,FIG. 3C) would require two MAC instructions per vector multiplication ofeach row of the matrix. Thus, where the product: $\begin{bmatrix}0 & 0 & 0 & R_{0} \\0 & 0 & R_{0} & R_{1} \\0 & R_{0} & R_{1} & R_{2} \\R_{0} & R_{1} & R_{2} & R_{3}\end{bmatrix} \times \begin{bmatrix}I_{3} \\I_{2} \\I_{1} \\I_{0}\end{bmatrix}$

[0085] would require four MAC operations to compute on 4-way SIMDengine, the same product would require eight MAC operations to computeon a 2-way SIMD machine.

[0086] It is further noted that word size can determine the amount ofparallelism attainable. Consider a 4-way SIMD, using 64-bit registers. A16-bit data size results in a single MAC instruction per vectormultiplication of a row in the matrix. However, an 8-bit data size wouldallow for two such multiplication operations to occur per MACinstruction. Conversely, a 32-bit data size would require two MACinstructions per matrix row.

[0087] It can be appreciated from the foregoing that varying degrees ofparallelism and hence attainable performance gains can be achieved by aproper selection of SIMD parallelism and word size. The selectioninvolves tradeoffs of available technology, system cost, performancegoals such as speed, quality of synthesized speech, and the like. Whilesuch considerations may be particularly relevant to the specificimplementation of the present invention, they are not germane to theinvention itself.

[0088] The foregoing description of the present invention was presentedusing human speech as the source of analog signal being processed. Itnoted this is merely for convenience of explanation. It can beappreciated that any form of analog signal of bandwidth within thesampling capability of the system can be subject to the processingdisclosed herein, and that the term “speech” can therefore be expandedto refer any such analog signals.

[0089] It can be further appreciated that the specific arrangement whichhas been described is merely illustrative of one implementation of anembodiment according to the principles of the invention. Numerousmodifications may be made by those skilled in the art without departingfrom the true spirit and scope of the invention as set forth in thefollowing claims.

What is claimed is:
 1. In a computer device for speech synthesis, amethod for searching a codebook of excitation vectors to identify aselected excitation vector for CELP (code-excited linear prediction)coding comprising: computing a metric M_(i) based on an excitationvector v_(i); repeating the computing step for each excitation vector inthe codebook; and identifying a minimum metric (M_(min)) from among thecomputed M_(i)'s, the excitation vector associated with M_(min) beingthe selected excitation vector, wherein the computing step includescomputing a correlation quantity between a target vector signal and animpulse response comprising: accessing elements R_(i) of a first vector(R) stored in a first area of a memory component of the computer deviceand representative of the target vector signal; accessing elements I_(i)of a second vector (I) stored in a second area of the memory componentand representative of the impulse response; $\begin{matrix}{{{{computing}\quad a\quad {vector}\quad F\quad 1} = {\begin{bmatrix}0 & \cdots & \cdots & 0 & R_{0} \\\vdots & \quad & \quad & R_{0} & R_{1} \\\vdots & \quad & \ddots & \quad & \vdots \\0 & R_{0} & \quad & \quad & \vdots \\R_{0} & R_{1} & \cdots & \cdots & R_{({2^{s} - 1})}\end{bmatrix} \times \begin{bmatrix}\begin{matrix}\begin{matrix}\begin{matrix}I_{2^{s} - 1} \\\vdots\end{matrix} \\\vdots\end{matrix} \\\vdots\end{matrix} \\I_{0}\end{bmatrix}}};{and}} \\{{{{computing}\quad a\quad {vector}\quad {F2}} = {\sum\limits_{n = 2^{s}}^{{Frm},{{step}\quad 4}}\quad \begin{Bmatrix}{{\begin{bmatrix}0 & \cdots & \cdots & 0 & R_{n} \\\vdots & \quad & \quad & R_{n} & R_{n + 1} \\\vdots & \quad & \ddots & \quad & \vdots \\0 & R_{n} & \quad & \quad & \vdots \\R_{n} & R_{n + 1} & \cdots & \cdots & R_{n + {({2^{s} - 1})}}\end{bmatrix} \times \begin{bmatrix}\begin{matrix}\begin{matrix}\begin{matrix}I_{2^{s} - 1} \\\vdots\end{matrix} \\\vdots\end{matrix} \\\vdots\end{matrix} \\I_{0}\end{bmatrix}} +} \\{\sum\limits_{\underset{l = 0}{m = {n + {({2^{s} - 1})}}}}^{\overset{\overset{{m - {2 \times {({2^{s} - 1})}}} > 0}{m,{{step}\quad - 4}}}{l,{{step}\quad 4}}}\quad {\begin{bmatrix}I_{{({m - {({2^{s} - 1})}})} - {({2^{s} - 1})}} & \cdots & I_{({m - {({2^{s} - 1})}})} \\\vdots & \quad & \vdots \\\vdots & \quad & \vdots \\I_{({m - {({2^{s} - 1})}})} & \cdots & I_{m}\end{bmatrix} \times \begin{bmatrix}\begin{matrix}\begin{matrix}R_{l + {({2^{s} - 1})}} \\\vdots\end{matrix} \\\vdots\end{matrix} \\R_{l}\end{bmatrix}}}\end{Bmatrix}}},}\end{matrix}$

where s>1 and Frm is a framesize, wherein the vectors F1 and F2 togetherare representative of the correlation quantity.
 2. The method of claim 1wherein the metric M_(i) is defined by$\left( \frac{\left( {dv}_{i} \right)^{2}}{v_{i}^{T}\varphi \quad v_{i}} \right),$

where d is the correlation quantity and φ is a covariance matrix of theimpulse response.
 3. The method of claim 1 wherein s=2.
 4. The method ofclaim 1 wherein the computing steps are performed by a centralprocessing unit having a 2^(s)-way SIMD (single instruction multipledata) instruction set.
 5. The method of claim 1 wherein the computingsteps are performed by a central processing unit having a 2^(s+1)-waySIMD (single instruction multiple data) instruction set.
 6. The methodof claim 5 wherein the SIMD instruction set includes a multiply andaccumulate (MAC) instruction, each of the matrix products [ . . . ]×[ .. . ] includes executing 2^(s−1) MAC instructions.
 7. The method ofclaim 1 wherein the computing steps are performed by a centralprocessing unit having a 2^(t)-way SIMD (single instruction multipledata) instruction set, where t≠s.
 8. The method of claim 1 wherein thestep of computing the vector F2 includes loading the elements I_((m−(2)_(^(s)) ⁻¹⁾⁾ through I_(m) from the vector I into a first set of one ormore registers in a central processing unit (CPU) of the computingdevice, wherein the elements I_((m−(2) _(^(s)) ⁻¹⁾⁾⁽² _(^(s)) ⁻¹⁾through I_((m−(2) _(^(s)) ⁻¹⁾⁾⁺¹ from the vector I will have beenpreviously loaded into a second set of one or more registers in the CPU.9. A computer program product suitable for execution on a dataprocessing device for use in a speech synthesis system, the dataprocessing device supporting SIMD (single instruction multiple data)instructions comprising: computer readable media containing a computerprogram to select an excitation vector from codebook containing aplurality of excitation vectors v, the computer program comprising:first computer program code to operate the data processing device toaccess from a first area of a memory component elements R_(i) of avector R representative of a target vector signal; second computerprogram code to operate the data processing device to access from asecond area of the computer memory component elements I_(i) of a vectorI representative of an impulse response; third computer program code tooperate the data processing device to access the excitation vectors vfrom the codebook, the codebook stored in a third area of the computermemory component; fourth computer program code to operate the dataprocessing device to compute a metric M_(i) based on an excitationvector v_(i), including computing a vector F2 which is a portion of acorrelation vector d representative of a correlation between the targetvector signal and the impulse response, where$\quad {{{{vector}\quad F\quad 2} = {\sum\limits_{n = 2^{s}}^{{Frm},{{step}\quad 4}}\quad \begin{Bmatrix}{{\begin{bmatrix}0 & \cdots & \cdots & 0 & R_{n} \\\vdots & \quad & \quad & R_{n} & R_{n + 1} \\\vdots & \quad & \ddots & \quad & \vdots \\0 & R_{n} & \quad & \quad & \vdots \\R_{n} & R_{n + 1} & \cdots & \cdots & R_{n + {({2^{s} - 1})}}\end{bmatrix} \times \begin{bmatrix}\begin{matrix}\begin{matrix}\begin{matrix}I_{2^{s} - 1} \\\vdots\end{matrix} \\\vdots\end{matrix} \\\vdots\end{matrix} \\I_{0}\end{bmatrix}} +} \\{\sum\limits_{\underset{l = 0}{m = {n + {({2^{s} - 1})}}}}^{\overset{\overset{{m - {2 \times {({2^{s} - 1})}}} > 0}{m,{{step}\quad - 4}}}{l,{{step}\quad 4}}}\quad {\begin{bmatrix}I_{{({m - {({2^{s} - 1})}})} - {({2^{s} - 1})}} & \cdots & I_{({m - {({2^{s} - 1})}})} \\\vdots & \quad & \vdots \\\vdots & \quad & \vdots \\I_{({m - {({2^{s} - 1})}})} & \cdots & I_{m}\end{bmatrix} \times \begin{bmatrix}\begin{matrix}\begin{matrix}R_{l + {({2^{s} - 1})}} \\\vdots\end{matrix} \\\vdots\end{matrix} \\R_{l}\end{bmatrix}}}\end{Bmatrix}}},}$

s>1 and Frm is a framesize; and fifth computer program code tocoordinate the first, second, third and fourth computer program codes tocompute a metric for each excitation vector in the codebook and toidentify a minimum metric therefrom, the excitation vector associatedwith the minimum metric being the selected excitation vector.
 10. Thecomputer program product of claim 9 wherein the metric M_(i) is definedby$\left( \frac{\left( {dv}_{i} \right)^{2}}{v_{i}^{T}\varphi \quad v_{i}} \right),$

where φ is a covariance matrix of the impulse response.
 11. The computerprogram product of claim 9 further including additional computer programcode to operate the data processing device to compute a vector F1, where${{{vector}\quad {F1}} = {\begin{bmatrix}0 & \cdots & \cdots & 0 & R_{0} \\\vdots & \quad & \quad & R_{0} & R_{1} \\\vdots & \quad & \ddots & \quad & \vdots \\0 & R_{0} & \quad & \quad & \vdots \\R_{0} & R_{1} & \cdots & \cdots & R_{({2^{s} - 1})}\end{bmatrix} \times \begin{bmatrix}I_{2^{s} - 1} \\\vdots \\\vdots \\\vdots \\I_{0}\end{bmatrix}}},$

wherein the vector F1 and the vector F2 together constitute thecorrelation vector d.
 12. The computer program product of claim 9wherein s=2 and the SIMD instructions include a 4-way multiply andaccumulate (MAC) instruction and each of the two matrix products [ . . .]×[ . . . ] includes executing four MAC instructions.
 13. The computerprogram product of claim 9 wherein s=2 and the SIMD instructions includean 8-way multiply and accumulate (MAC) instruction and each of the twomatrix product operations [ . . . ]×[ . . . ] includes executing two MACinstructions.
 14. A speech codec device comprising: a processingcomponent supporting one or more single instruction multiple data (SIMD)instructions; a data storage component coupled to the processingcomponent for transferring data therebetween; a first portion of thedata storage component having stored therein a codebook of excitationvectors v; a second portion of the data storage component having storedtherein a vector R representative of a target vector signal; a thirdportion of the data storage component having stored therein a vector Irepresentative of an impulse response to a synthesis filter; andcomputer program code stored in the data storage component comprising acode portion suitable for execution on the processing component tocompute a metric M_(i)=$\left( \frac{\left( {dv}_{i} \right)^{2}}{v_{i}^{T}\varphi \quad v_{i}} \right)$

for an excitation vector v_(i), where φ is a covariance matrix of theimpulse response and d is a correlation vector representative of acorrelation between the target vector signal and the impulse response,the correlation vector d comprising a vector F1 and a vector F2, wherein${{vector}\quad {F1}} = {\begin{bmatrix}0 & \cdots & \cdots & 0 & R_{0} \\\vdots & \quad & \quad & R_{0} & R_{1} \\\vdots & \quad & \ddots & \quad & \vdots \\0 & R_{0} & \quad & \quad & \vdots \\R_{0} & R_{1} & \cdots & \cdots & R_{({2^{s} - 1})}\end{bmatrix} \times \begin{bmatrix}I_{2^{s} - 1} \\\vdots \\\vdots \\\vdots \\I_{0}\end{bmatrix}\quad {and}}$${{{vector}\quad {F2}} = {\sum\limits_{n = 2^{s}}^{{Frm},{{step}\quad 4}}\begin{Bmatrix}{{\begin{bmatrix}0 & \cdots & \cdots & 0 & R_{n} \\\vdots & \quad & \quad & R_{n} & R_{n + 1} \\\vdots & \quad & \ddots & \quad & \vdots \\0 & R_{n} & \quad & \quad & \vdots \\R_{n} & R_{n + 1} & \cdots & \cdots & R_{n + {({2^{s} - 1})}}\end{bmatrix} \times \begin{bmatrix}I_{2^{s} - 1} \\\vdots \\\vdots \\\vdots \\I_{0}\end{bmatrix}} +} \\\begin{matrix}\quad & {\begin{matrix}{\sum\limits_{\underset{l = 0}{m = {n + {({2^{s} - 1})}}}}^{\overset{{m - {2 \times {({2^{s} - 1})}}} > 0}{\overset{m,{{step} - 4}}{i,{{step}\quad 4}}}}{\begin{bmatrix}I_{{({m - {({2^{s} - 1})}})} - {({2^{s} - 1})}} & \cdots & I_{({m - {({2^{s} - 1})}})} \\\vdots & \quad & \vdots \\\vdots & \quad & \vdots \\I_{({m - {({2^{s} - 1})}})} & \cdots & I_{m}\end{bmatrix} \times}} \\\begin{bmatrix}R_{l + {({2^{s} - 1})}} \\\vdots \\\vdots \\R_{l}\end{bmatrix}\end{matrix}\quad} \\\quad & \quad\end{matrix}\end{Bmatrix}}},$

where s>1 and Frm is a framesize, the computer program code furthercomputing a plurality of the metrics M_(i) and identifying a minimum oneof the metrics M_(min), wherein the excitation vector corresponding toM_(min) constitutes a selected excitation vector.
 15. The device ofclaim 14 wherein the one or more SIMD instructions provide N-wayparallelism, wherein N and 2^(s) are related by a power of
 2. 16. Thedevice of claim 14 wherein s=2.
 17. The device of claim 14 wherein theone or more SIMD instructions provide 4-way parallelism and s=2.
 18. Thedevice of claim 14 wherein the one or more SIMD instructions provide8-way parallelism and s=2, and wherein each of the three matrix products[ . . . ]×[ . . . ] includes executing two multiply and accumulateinstructions.
 19. A speech synthesis device comprising: data processingmeans for performing single instruction multiple data (SIMD) operations,including a multiply and accumulate (MAC) operation; memory means, indata communication with the data processing means, for storing a vectorR representative of a target vector signal, a vector I representative ofan impulse response to a synthesis filter, and a codebook of excitationvectors v; and computer program code stored in the memory meanscomprising a code segment suitable for execution on the data processingmeans to compute a metric M_(i)=$\left( \frac{\left( {dv}_{i} \right)^{2}}{v_{i}^{T}\varphi \quad v_{i}} \right)$

for an excitation vector v_(i), where φ is a covariance matrix of theimpulse response and d is a correlation vector representative of acorrelation between the target vector signal and the impulse response,the correlation vector d comprising a vector F1 and a vector F2, wherein${{{vector}\quad {F1}} = {\begin{bmatrix}0 & 0 & 0 & R_{0} \\0 & 0 & R_{0} & R_{1} \\0 & R_{0} & R_{1} & R_{2} \\R_{0} & R_{1} & R_{2} & R_{3}\end{bmatrix} \times \begin{bmatrix}I_{3} \\I_{2} \\I_{1} \\I_{0}\end{bmatrix}\quad {and}}}\quad$${{{vector}\quad {F2}} = {\sum\limits_{n = 4}^{{Frm},{{step}\quad 4}}\begin{Bmatrix}{{{\begin{bmatrix}0 & 0 & 0 & R_{n} \\0 & 0 & R_{n} & R_{n + 1} \\0 & R_{n} & R_{n + 1} & R_{n + 2} \\R_{n} & R_{n + 1} & R_{n + 2} & R_{n + 3}\end{bmatrix} \times \begin{bmatrix}I_{3} \\I_{2} \\I_{1} \\I_{0}\end{bmatrix}} +}} \\{\sum\limits_{\underset{l = 0}{m = {n + {({2^{s} - 1})}}}}^{\overset{{m - 6} > 0}{\overset{m,{{step} - 4}}{l,{{step}\quad 4}}}}{\begin{bmatrix}I_{m - 6} & I_{m - 5} & I_{m - 4} & I_{m - 3} \\I_{m - 5} & I_{m - 4} & I_{m - 3} & I_{m - 2} \\I_{m - 4} & I_{m - 3} & I_{m - 2} & I_{m - 1} \\I_{m - 3} & I_{m - 2} & I_{m - 1} & I_{m}\end{bmatrix} \times \begin{bmatrix}R_{l + 3} \\R_{l + 2} \\R_{l + 1} \\R_{l}\end{bmatrix}}}\end{Bmatrix}}},$

where Frm is a framesize.
 21. The speech synthesis device of claim 19wherein the MAC instruction is a 4-way parallel instruction.
 22. Thespeech synthesis device of claim 19 wherein the MAC instruction is an8-way parallel instruction and each of the three matrix productoperations [ . . .]×[ . . . ] includes executing two MAC instructions.